Supervised vs. Unsupervised Analysis

To compare supervised and unsupervise analyses is like comparing The Beatles and The Velvet Underground music. Both are good, just good in different ways. While supervised model is dominating in the modern research since it provides guaranteed but slow progress, the unsupervised strategy has always been considered as a “high risk high gain” approach.

Multi-OMICs Factor Analysis (MOFA)

MOFA is a typical hypothesis-free data exploration framework, https://www.embopress.org/doi/10.15252/msb.20178124. It allows data integration by extracting common axes of variation from the multiple OMICs layers. Given several data matrices with measurements of multiple ‘omics data types on the same or on overlapping sets of samples, MOFA infers a low-dimensional data representation in terms of (hidden) factors which can be then visualized and interpreted.

Importantly, MOFA disentangles whether the underlying axes of heterogeneity are unique to a single data modality or are manifested in multiple modalities, thereby identifying links between the different ‘omics. Once trained, the model output can be used for a range of downstream analyses, including

How does a typical MOFA output look like? In the MOFA paper they applied this method to Chronic Lymphocytic Leukaemia (CLL) on 200 human patients that combined 1) drug response, 2) somatic mutations (targeted sequencing), 3) RNAseq data, and 4) Methylation array data.

Notably, nearly 40% of the 200 samples were profiled with some but not all OMICs types; such a missing value scenario is not uncommon in large cohort studies, and MOFA is designed to cope with it.

One can inspect loadings of MOFA hidden factors and observe known bio-markers fo CLL. Specifically, factor 1 was aligned with the somatic mutation status of the immunoglobulin heavy-chain variable region gene (IGHV), while Factor 2 aligned with trisomy of chromosome 12.

Loadings from each factor can be used for pathway enrichment analysis. Interestingly, markers from diffeerent OMICs co-varying in a certain MOFA factor can be shown to be coupled to the same biological pathway. For CLL data, factor 5 demonstrates co-variation between drug response and mRNA levels. This factor tagged a set of genes that were highly enriched for oxidative stress and senescence pathways. Consistent with the annotation based on the mRNA view, it is observed that the drugs with the strongest weights on Factor 5 were associated with response to oxidative stress, such as target reactive oxygen species, DNA damage response and apoptosis.

Next MOFA can perform efficient imputation of missing values including imputation of “missing patients”, i.e. when a patient is present in one OMICs layer but absent in another, this absent layer can be reconstructed for the patient, i.e. gray bars in the panel a) of the figure above can be filled.

Finally, latent factors inferred by MOFA are predictive of clinical outcomes. The authors of MOFa paper assessed the prediction performance when combining the 10 MOFA factors in a multivariate Cox regression model (assessed using cross-validation). Notably, this model yielded higher prediction accuracy than the 10 factors derived from conventional PCA or using the individual features.

Important to realize that despite the results of PCA and Factor Analysis visually look very similar, as they both provide linear variance-covariance matrix decomposition, they have quite a different math behind. While PCA is a pure matrix factorization technique which splits the total variance into orthogonal Principal Components (PCs), Factor Analysis seeks to construct hidden latent variables that generate the observed data, therefore Factor Analysis is a generative model.

 

Prepare scNMT Data Set for MOFA

In this section we will apply MOFA to the single cell multi-OMICs data set scNMT which profiles chromatine accessebility (scATACseq), DNA methylation (scBSseq) and gene expression (scRNAseq) information from the same biological cell belonging to two types: differentiating mouse embryonic stem cells (ESCs) and embrionic bodies (EBs). We will start with reading and having a look at the individual OMICs data sets:

scRNAseq<-read.delim("scRNAseq.txt", header = TRUE, check.names = FALSE, row.names = 1, sep="\t")
ens2genesymbol<-read.delim("ENSEMBLE_TO_GENE_SYMBOL_MOUSE.txt")
ens2genesymbol<-ens2genesymbol[match(colnames(scRNAseq),as.character(ens2genesymbol$ensembl_gene_id)),]
colnames(scRNAseq)<-ens2genesymbol$external_gene_name
scRNAseq<-as.data.frame(t(scRNAseq))
scRNAseq[1:5,1:5]
##          ESC_A02  ESC_A03  ESC_A04  ESC_A05  ESC_A06
## Mrpl15  557.4201 130.8536 305.7356 417.1656 573.6504
## Lypla1  410.5958 138.8325 186.3237 106.0895 662.7008
## Tcea1   437.9119 356.3899 276.9121 161.8315 213.3068
## Atp6v1h 345.7199 500.5417 195.5884 133.0614 325.1376
## Oprk1     0.0000   0.0000   0.0000   0.0000   0.0000
scBSseq<-read.delim("scBSseq.txt", header = TRUE, check.names = FALSE, row.names = 1, sep="\t")
scBSseq<-as.data.frame(t(scBSseq))
scBSseq[1:5,1:5]
##               ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_101606896   0.00000       0       0       0       0
## 1_101935054  88.88889     100     100     100      30
## 1_101935061  88.88889     100     100     100      90
## 1_102002184 100.00000     100       0       0       0
## 1_102238204 100.00000     100     100     100     100
scATACseq<-read.delim("scATACseq.txt", header = TRUE, check.names = FALSE, row.names = 1, sep="\t")
scATACseq<-as.data.frame(t(scATACseq))
scATACseq[1:5,1:5]
##             ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_100392139     100     100     100     100      67
## 1_100392151     100     100     100      50      67
## 1_100668590      50      86      44      45      28
## 1_100738967      83      88      83      83      88
## 1_100994324       0       0      11      75       0

For scRNAseq OMIC layer we will only select highly expressed genes in order to remove the noisy genes which might contaminate the further downstream analysis. We will also perform log-transform of the data which can be seen as a mild normalization:

scRNAseq<-scRNAseq[rowMeans(scRNAseq)>=1,]
scRNAseq<-log10(scRNAseq + 1)
scRNAseq[1:5,1:5]
##          ESC_A02  ESC_A03  ESC_A04  ESC_A05  ESC_A06
## Mrpl15  2.746961 2.120092 2.486764 2.621348 2.759404
## Lypla1  2.614471 2.145608 2.272593 2.029747 2.821972
## Tcea1   2.642377 2.553142 2.443907 2.211738 2.331036
## Atp6v1h 2.539979 2.700307 2.293558 2.127304 2.513401
## Oprk1   0.000000 0.000000 0.000000 0.000000 0.000000
dim(scRNAseq)
## [1] 12145   113

Since scBSseq and scATACseq OMIC layers should be almost a binary (methylated vs unmethylated for scBSseq and open vs. close for scATACseq) data sets, a good way to get rid of redundancy in the scBSseq and scATACseq data and thus perform feature pre-selection and reduce dimensions is to select only features with high variability:

library("mixOmics")
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Loaded mixOmics 6.3.1
## 
## Visit http://www.mixOmics.org for more details about our methods.
## Any bug reports or comments? Notify us at mixomics at math.univ-toulouse.fr or https://bitbucket.org/klecao/package-mixomics/issues
## 
## Thank you for using mixOmics!
my_nearZeroVar<-nearZeroVar(as.data.frame(t(scBSseq)))
head(my_nearZeroVar$Metrics)
##             freqRatio percentUnique
## 1_101606896    112.00     1.7699115
## 1_105525627     35.00     5.3097345
## 1_105539120     26.25     5.3097345
## 1_10605572       0.00     0.8849558
## 1_107485472      0.00     0.8849558
## 1_109269874     20.00     7.9646018
dim(my_nearZeroVar$Metrics)
## [1] 3290    2
scBSseq <- scBSseq[-which(rownames(scBSseq)%in%rownames(my_nearZeroVar$Metrics)),]
scBSseq[1:5,1:5]
##               ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_101935054  88.88889     100     100     100      30
## 1_101935061  88.88889     100     100     100      90
## 1_102002184 100.00000     100       0       0       0
## 1_102238204 100.00000     100     100     100     100
## 1_102255678 100.00000     100     100     100       0
dim(scBSseq)
## [1] 5285  113
my_nearZeroVar<-nearZeroVar(as.data.frame(t(scATACseq)), uniqueCut = 1)
head(my_nearZeroVar$Metrics)
##             freqRatio percentUnique
## 1_10401731          0     0.8849558
## 1_128314944         0     0.8849558
## 1_13628215          0     0.8849558
## 1_178804829         0     0.8849558
## 1_183332775         0     0.8849558
## 1_183944178         0     0.8849558
dim(my_nearZeroVar$Metrics)
## [1] 91  2
scATACseq <- scATACseq[-which(rownames(scATACseq)%in%rownames(my_nearZeroVar$Metrics)),]
scATACseq[1:5,1:5]
##             ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_100392139     100     100     100     100      67
## 1_100392151     100     100     100      50      67
## 1_100668590      50      86      44      45      28
## 1_100738967      83      88      83      83      88
## 1_100994324       0       0      11      75       0
dim(scATACseq)
## [1] 11709   113

Let us now have a look at the histograms of individual OMICs layers in order to decide which distribution they follow and how we should model these distributions with MOFA:

hist(rowMeans(scRNAseq),breaks=100, main = "scRNAseq")

hist(rowMeans(scBSseq), breaks = 100, main = "scBSseq")

hist(rowMeans(scATACseq), breaks = 100, main = "scATACseq")

We conclude that while scRNAseq data look fairly Gaussian, we should probably model the scBSseq and scATACseq data as following Bernoulli distribution as they look quite bimodal indicating the binary nature of the data, i.e. methylated vs. unmethylated for scBSseq and open vs. close for scATACseq. We will further make the scBSseq and scATACseq data sets purely binary by encoding values below 50 as 0 and above 50 as 1, we need to remove low-variance features in this case again:

scBSseq<-ifelse(scBSseq<50,0,1)
scBSseq[1:5,1:5]
##             ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_101935054       1       1       1       1       0
## 1_101935061       1       1       1       1       1
## 1_102002184       1       1       0       0       0
## 1_102238204       1       1       1       1       1
## 1_102255678       1       1       1       1       0
my_nearZeroVar<-nearZeroVar(as.data.frame(t(scBSseq)))
head(my_nearZeroVar$Metrics)
##             freqRatio percentUnique
## 1_110547534     21.60     1.7699115
## 1_133172191     27.25     1.7699115
## 1_139117169     27.25     1.7699115
## 1_146093078     21.60     1.7699115
## 1_194260338    112.00     1.7699115
## 1_24611678       0.00     0.8849558
dim(my_nearZeroVar$Metrics)
## [1] 761   2
scBSseq <- as.data.frame(scBSseq[-which(rownames(scBSseq)%in%rownames(my_nearZeroVar$Metrics)),])
scBSseq[1:5,1:5]
##             ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_101935054       1       1       1       1       0
## 1_101935061       1       1       1       1       1
## 1_102002184       1       1       0       0       0
## 1_102238204       1       1       1       1       1
## 1_102255678       1       1       1       1       0
dim(scBSseq)
## [1] 4524  113
scATACseq<-ifelse(scATACseq<50,0,1)
my_nearZeroVar<-nearZeroVar(as.data.frame(t(scATACseq)))
head(my_nearZeroVar$Metrics)
##             freqRatio percentUnique
## 1_100392139    112.00      1.769912
## 1_100392151     27.25      1.769912
## 1_100738967     55.50      1.769912
## 1_101897864     55.50      1.769912
## 1_102283335     55.50      1.769912
## 1_102563492    112.00      1.769912
dim(my_nearZeroVar$Metrics)
## [1] 2238    2
scATACseq <- as.data.frame(scATACseq[-which(rownames(scATACseq)%in%rownames(my_nearZeroVar$Metrics)),])
scATACseq[1:5,1:5]
##             ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_100668590       1       1       0       0       0
## 1_100994324       0       0       0       1       0
## 1_100994328       0       0       0       0       0
## 1_100994336       0       0       0       0       0
## 1_100994343       0       0       0       0       1
dim(scATACseq)
## [1] 9471  113

Run MOFA on scNMT Data Set

Let us continue with creating MOFA object:

library("MOFA")
## 
## Attaching package: 'MOFA'
## The following object is masked from 'package:stats':
## 
##     predict
omics<-list(scRNAseq=scRNAseq,scBSseq=scBSseq,scATACseq=scATACseq)
lapply(omics,dim)
## $scRNAseq
## [1] 12145   113
## 
## $scBSseq
## [1] 4524  113
## 
## $scATACseq
## [1] 9471  113
MOFAobject <- createMOFAobject(omics)
## Creating MOFA object from list of matrices,
##  please make sure that samples are columns and features are rows...
plotDataOverview(MOFAobject)

#DEFINE DATA OPTIONS
DataOptions <- getDefaultDataOptions()
DataOptions
## $scaleViews
## [1] FALSE
## 
## $removeIncompleteSamples
## [1] FALSE
#DEFINE MODEL OPTIONS
ModelOptions <- getDefaultModelOptions(MOFAobject)
mydistr<-c("gaussian","bernoulli","bernoulli")
names(mydistr)<-c("scRNAseq","scBSseq","scATACseq")
ModelOptions$likelihood<-mydistr
ModelOptions$numFactors<-20
ModelOptions
## $likelihood
##    scRNAseq     scBSseq   scATACseq 
##  "gaussian" "bernoulli" "bernoulli" 
## 
## $numFactors
## [1] 20
## 
## $sparsity
## [1] TRUE
#DEFINE TRAIN OPTIONS
TrainOptions <- getDefaultTrainOptions()
TrainOptions$seed <- 2018
# Automatically drop factors that explain less than 1% of variance in all omics
TrainOptions$DropFactorThreshold <- 0.03
TrainOptions$tolerance <- 0.1
TrainOptions$maxiter<-1000
TrainOptions$verbose<-TRUE
TrainOptions
## $maxiter
## [1] 1000
## 
## $tolerance
## [1] 0.1
## 
## $DropFactorThreshold
## [1] 0.03
## 
## $verbose
## [1] TRUE
## 
## $seed
## [1] 2018

Let us run MOFA:

MOFAobject <- prepareMOFA(MOFAobject, DataOptions = DataOptions, 
                          ModelOptions = ModelOptions, TrainOptions = TrainOptions)
## Checking data options...
## Checking training options...
## Checking model options...
MOFAobject <- runMOFA(MOFAobject)
## [1] "No output file provided, using a temporary file..."

Let us check the covariance of the OMICs layers:

#ANALYZE RESULTS OF MOFA INTEGRATION
r2 <- calculateVarianceExplained(MOFAobject)
r2$R2Total
##  scATACseq    scBSseq   scRNAseq 
## 0.04628196 0.05484257 0.13215154
head(r2$R2PerFactor)
##        scATACseq      scBSseq   scRNAseq
## LF1 0.0458775605 1.482963e-03 0.06899144
## LF2 0.0003573541 5.366988e-02 0.02739404
## LF3 0.0001293258 7.427593e-05 0.03848415
plotVarianceExplained(MOFAobject)

We can see that scRNAseq provides the largest variation (13%) in the total integrative OMICs scNMT data set, scBSseq and scATACseq contribute around 5% of variation. We also observe that MOFA selected 3 Latent Factors (LFs), scRNAseq contributes to all of them while second LF is driven by scBSseq and scATACseq contributes only to the first LF, scRNAseq covaries with scATACseq in the first LF while scRNAseq covary with scBSseq in the second LF. Now let us interpret the LFs by displaying the features associated with each LF:

NumFactors<-dim(getFactors(MOFAobject))[2]

plotWeights(MOFAobject, view = "scRNAseq", factor = 1)

plotTopWeights(object = MOFAobject, view = "scRNAseq", factor = 1, nfeatures = 10)

plotDataHeatmap(object = MOFAobject, view = "scRNAseq", factor = "LF1", features = 10, transpose = TRUE, show_colnames=TRUE, show_rownames=TRUE,cluster_cols = TRUE)

plotWeights(MOFAobject, view = "scBSseq", factor = 2)

plotTopWeights(object = MOFAobject, view = "scBSseq", factor = 2, nfeatures = 10)

plotDataHeatmap(object = MOFAobject, view = "scBSseq", factor = "LF2", features = 10, transpose = TRUE, show_colnames=TRUE, show_rownames=TRUE,cluster_cols = TRUE)

plotWeights(MOFAobject, view = "scATACseq", factor = 1)

plotTopWeights(object = MOFAobject, view = "scATACseq", factor = 1, nfeatures = 10)

plotDataHeatmap(object = MOFAobject, view = "scATACseq", factor = "LF1", features = 10, transpose = TRUE, show_colnames=TRUE, show_rownames=TRUE,cluster_cols = TRUE)

Let us also display the cells in the low-dimensional latent space:

controls<-c("EB_P1D12","EB_P1E12","EB_P1F12","EB_P1G12","EB_P2B12","EB_P2D12","EB_P2E12","EB_P2F12","EB_P2G12",
            "ESC_B12","ESC_C12","ESC_D12")
colnames(scRNAseq)<-ifelse(colnames(scRNAseq)%in%controls,"CONTROL_CELL",colnames(scRNAseq))
cell_types<-matrix(unlist(strsplit(colnames(scRNAseq),"_")),ncol=2,byrow=TRUE)[,1]

plotFactorScatter(MOFAobject, factors = 1:2, color_by = as.factor(cell_types))

We conclude that the integrative OMICs approach can distinguish between all three types of cells: ESCs, EBs and Controls.